Trigram morphosyntactic tagger for Polish

نویسنده

  • Lukasz Debowski
چکیده

We introduce an implementation of a plain trigram part-of-speech tagger which appears to work well on Polish texts. At this moment the tagger achieves 9.4% error rate, which makes it signficantly better than our previous stochastic disambiguator. Since the trigram model for Polish behaves similarly to Czech, we hope to reach Czech state-of-art error rate when the quality of the training data improves.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Machine Learning of Morphosyntactic Structure: Lemmatizing Unknown Slovene Words

Automatic lemmatization is a core application for many language processing tasks. In inflectionally rich languages, such as Slovene, assigning the correct lemma (base form) to each word in a running text is not trivial, since for instance, nouns inflect for number and case, with a complex configuration of endings and stem modifications. The problem is especially difficult for unknown words, sin...

متن کامل

Evaluating Morphosyntactic Tagging of Croatian Texts

This paper describes results of the first successful effort in applying a stochastic strategy – or, namely, a second order Markov model paradigm implemented by the TnT trigram tagger – to morphosyntactic tagging of Croatian texts. Beside the tagger, for purposes of both training and testing, we had at our disposal only a 100 Kw Croatia Weekly newspaper subcorpus, manually tagged using approxima...

متن کامل

Learning to Lemmatise Slovene Words

Automatic lemmatisation is a core application for many language processing tasks. In inflectionally rich languages, such as Slovene, assigning the correct lemma to each word in a running text is not trivial: nouns and adjectives, for instance, inflect for number and case, with a complex configuration of endings and stem modifications. The problem is especially difficult for unknown words, as wo...

متن کامل

A Tiered CRF Tagger for Polish

In this paper we present a new approach to morphosyntactic tagging of Polish by bringing together Conditional Random Fields and tiered tagging. Our proposal also allows to take advantage of a rich set of morphological features, which resort to an external morphological analyser. The proposed algorithm is implemented as a tagger for Polish. Evaluation of the tagger shows significant improvement ...

متن کامل

A Rule-Based Tagger for Polish Based on Genetic Algorithm

In the paper an approach to the construction of rule-based morphosyntactic tagger for Polish is proposed. The core of the tagger are modules of rules (classification systems), acquired from the IPI PAN corpus by application of Genetic Algorithms. Each module is specialised in making decisions concerning different parts of a tag (a structure of attributes). The acquired rules are combined with l...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004